This is a vehicle description dataset where the features were extracted from the silhouettes by the HIPS (Hierarchical Image Processing System) extension BINATTS, which extracts a combination of scale independent features utilising both classical moments based measures such as scaled variance, skewness and kurtosis about the major/minor axes and heuristic measures such as hollows, circularity, rectangularity and compactness.
Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.
To classify a given silhouette as one of four types of vehicle, using a set of features extracted from the silhouette.The vehicle may be viewed from one of many different angles.
Attribute Information:
Where $sigma\_maj^2$ is the variance along the major axis and $sigma\_min^2$ is the variance along the minor axis, and area of hollows = area of bounding poly-area of object
Let's analyze the dataset and obtain multicolinearity if any. Apply Principal Component Analysis (PCA) to reduce the dimenstion with covering more than 95% variance
# Utilities
from time import time
# Numerical calculation
import numpy as np
# Data handling
import pandas as pd
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Sample and parameter tuning
from sklearn.model_selection import train_test_split, GridSearchCV
# Predictive Modeling
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
# Feature Engineering
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# Evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, accuracy_score, precision_recall_curve
# Configure for any default setting of any library
%matplotlib inline
# sns.set(style='whitegrid', palette='deep', font='sans-serif', font_scale=1.2, color_codes=True)
Comments
%matplotlib inline sets the backend of matplotlib to the 'inline' backend: With this backend, the output of plotting commands is displayed inline without needing to call plt.show() every time a data is plotted.# Load the dataset into a Pandas dataframe called vehicle
vehicle = pd.read_csv('vehicle.csv')
# Check the head of the dataset
vehicle.head()
# Check the tail of the dataset
vehicle.tail()
Comments
Data pre-processing - Understand the data and treat missing values, outliers
The dataset is divided into two parts, namely, feature matrix and the response vector.
# Get the shape and size of the dataset
vehicle.shape
# Get more info on it
# 1. Name of the columns
# 2. Find the data types of each columns
# 3. Look for any null/missing values
vehicle.info()
Observations
# Find out the null value counts in each column
vehicle.isnull().sum()
# Check for any Non-Real value present in the dataset such as '?' or '*' etc.
vehicle[~vehicle.iloc[:,:-1].applymap(np.isreal).all(1)]
Comments
np.isreal a numpy function which checks each column for each row and returns a bool array, .applymap is pandas dataframe function that applies the np.isreal function columnwiseObservation
# Look at the individual distribution histogram
_ = vehicle.hist(figsize=(20,20), bins=20, color='turquoise')
As there are 3 different vehicle types present in the dataset, we will be replacing the NaN values in each feature with the median from their respective class types.
# Split the dataset according to their class types
unique_vehicles = [vehicle[vehicle['class'] == veh] for veh in vehicle['class'].unique()]
# Replaces the NULLs with the median of the respective feature
for unique_veh in unique_vehicles:
for col in unique_veh.columns[:-1]:
median = unique_veh[col].median()
unique_veh[col] = unique_veh[col].fillna(median)
# Join the splitted datasets back together and sort the index
vehicle = pd.concat(unique_vehicles).sort_index()
# Check the dataset after NULL treatment
vehicle.isnull().sum()
# Describe the dataset with various other summary and statistics
vehicle.describe().T
# Plot the central tendency of the dataset
_, bp = vehicle.boxplot(return_type='both', figsize=(20,10), rot='vertical')
fliers = [flier.get_ydata() for flier in bp["fliers"]]
boxes = [box.get_ydata() for box in bp["boxes"]]
caps = [cap.get_ydata() for cap in bp['caps']]
whiskers = [whiskers.get_ydata() for whiskers in bp["whiskers"]]
# Count the number of outlier data points present in each feature
for idx, col in enumerate(vehicle.columns[:-1]):
print(col, '--', len(fliers[idx]))
Observations:
Min/Max Replacement: There are 8 features present in the dataset which contains outliers. Outliers will be replaced by their nearest whisker ends in the central tendency. Means datapoints which are $1.5*IQR$ below the $Q_1$ will be replaced by min of lower whisker and which are $1.5*IQR$ above the $Q_3$ will be replaced by max of higher whisker.
# Treat the outlier data points
for idx, col in enumerate(vehicle.columns[:-1]):
q1 = vehicle[col].quantile(0.25)
q3 = vehicle[col].quantile(0.75)
low = q1 - 1.5*(q3 - q1)
high = q3 + 1.5*(q3 - q1)
vehicle.loc[(vehicle[col] < low), col] = caps[idx * 2][0]
vehicle.loc[(vehicle[col] > high), col] = caps[idx * 2 + 1][0]
# Check the dataset after Outlier treatment
sns.set_style('darkgrid')
plt.figure(figsize=(30, 30))
index = 1
for col in vehicle.columns[:-1]:
plt.subplot(1, len(vehicle.columns[:-1]), index)
sns.boxplot(y=vehicle[col], palette='inferno', fliersize=12)
index += 1
plt.tight_layout()
Understanding the attributes - Find relationship between different attributes (Independent variables) and choose carefully which all attributes have to be a part of the analysis and why
Bivariate analysis is one of the simplest forms of quantitative analysis which involves the study of two variables, for the purpose of determining the emperical relationship between them.
Pairplot helps picturizing the pair wise relationship between two variables. It creates a square matrix of no of continous attributes of the dataset. The diagonal plots represents the histogram and/or the kde plot of a particular attributes where as the upper or lower trangular plots represents the co-linearity of two attributes.
sns.pairplot(vehicle, diag_kind='kde')
# Visualize the correlation among independent features
plt.figure(figsize=(13,13))
sns.heatmap(vehicle.corr(), annot=True, linewidths=0.1, cmap='BrBG')
Observations:
sns.jointplot('skewness_about', 'skewness_about.1', vehicle, kind='reg', size=7, color='k')
Observations:
But let's perform Feature Importance analysis to see which features are more important in deciding the target class.
# Divide the dataset into Input features and Target variables
X = vehicle.drop('class', axis=1)
y = vehicle['class']
# Feature Importance plot using Random Forest Classifier
rf = RandomForestClassifier().fit(X, y)
pd.DataFrame(rf.feature_importances_, index = vehicle.columns[:-1],
columns=['Importance']).sort_values('Importance').plot(kind='barh', figsize=(15,7), title='Feature Importance')
Observation:
# Find count of unique target variable
len(y.unique())
# OR
y.nunique()
# What are the different values for the dependant variable
y.unique()
# Find out the value counts in each outcome category
# vehicle.groupby('class').size()
y.value_counts()
# Check the frequency distribution of each target class
fig, axes = plt.subplots(1, 2, figsize=(16,6))
sns.countplot(y, ax=axes[0], palette='rocket')
_ = axes[1].pie(y.value_counts(), autopct='%1.1f%%', shadow=True, startangle=90, labels=y.value_counts().index)
Observation
# Compare class wise mean
pd.pivot_table(vehicle, index='class', aggfunc=['mean']).T
Observations:
Use PCA from scikit learn and elbow plot to find out reduced number of dimension (which covers more than 95% of the variance)
Curse of dimensionality is the phenomenon where the feature space becomes increasingly sparse for an increasing number of dimensions of a fixed-size training dataset. Analyzing and organizing the data in a high-dimensional spaces (often with hundreds or thousands of dimensions) are always prone to various adverse outcomes. Most of the machine learning algorithms are very susceptible to overfitting due to the curse of dimensionality.
To overcome such situations, we do feature engineering where algorithms run their logic to reduce the higher no. of dimensions. PCA is one such feature extraction technique.
Principal Component Analysis (PCA) uses "orthogonal linear transformation" to introduces a lower-dimensional representation of the dataset. It finds a sequence of linear combination of the variables called the principal components that explain the maximum variance and summarize the most information in the data and are mutually uncorrelated with each other.
PCA allows us to quantify the trade-offs between the number of features we utilize and the total variance explained by the data. PCA allows us to determine which features capture similiar information and discard them to create a more parsimonious model.
In order to perform PCA we need to do the following:
StandardScaler() will normalize the features so that each column/feature/variable will have mean = 0 and standard deviation = 1.
sc = StandardScaler()
X_scaled = sc.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
X_scaled.describe().T
covar_matrix = PCA(n_components=X_scaled.shape[1])
covar_matrix
covar_matrix.fit(X_scaled)
# calculate variance ratios
var = covar_matrix.explained_variance_ratio_;var
# cumulative sum of variance explained with [n] features
eigen_vals = np.round(covar_matrix.explained_variance_ratio_, decimals=3)*100
np.cumsum(eigen_vals)
Observation:
Generic Method to draw a Scree Plot
def generate_scree_plot(covar_matrix, threshold):
var = covar_matrix.explained_variance_
eigen_vals = np.cumsum(np.round(covar_matrix.explained_variance_ratio_, decimals=3)*100)
f, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(20,7))
f.suptitle('PCA Scree plot')
ax1.plot(np.arange(1, len(var)+1), var, '-go')
ax1.set_xticks(np.arange(1, len(var)+1))
ax1.set_title('Explained Variance')
ax1.set_xlabel('# of Components')
ax1.set_ylabel('Eigen Values')
ax2.plot(np.arange(1, len(eigen_vals)+1), eigen_vals, ':k', marker='o', markerfacecolor='red', markersize=8)
ax2.set_xticks(np.arange(1, len(eigen_vals)+1))
ax2.axhline(y=threshold, color='r', linestyle=':', label='Threshold(95%)')
ax2.legend()
ax2.plot(np.arange(sum(eigen_vals <= threshold) + 1, len(eigen_vals) + 1),
[val for val in eigen_vals if val > threshold], '-bo')
ax2.set_ylim(bottom=threshold-10, top=100)
ax2.set_xlim(right=11)
ax2.set_title('Cumulative sum Explained Variance Ratio')
ax2.set_xlabel('# of Components')
ax2.set_ylabel('% Variance Explained')
generate_scree_plot(covar_matrix, threshold=95)
plt.figure(figsize=(15,10))
plt.axhline(y=95, color='r', linestyle=':')
plt.bar(np.arange(1, len(eigen_vals) + 1), eigen_vals)
plt.plot(np.arange(1, len(np.cumsum(eigen_vals))+1), np.cumsum(eigen_vals), drawstyle='steps-mid')
plt.yticks(np.arange(0,100,5))
_, _ = plt.xticks(np.arange(1,18,1)), plt.xlabel('# of Components')
plt.xlim(0, 12)
Observations:
# Create a new matrix using the n components
X_projected = PCA(n_components=7).fit_transform(X_scaled)
X_projected.shape
# Let's create a generic method to train and test the model
def run_classification(estimator, X_train, X_test, y_train, y_test):
# train the model
clf = estimator.fit(X_train, y_train)
# predict from the claffier
y_pred = clf.predict(X_test)
print('*'*100)
print('Estimator:', clf)
print('-'*80)
print('Training accuracy: %.2f%%' % (accuracy_score(y_train, clf.predict(X_train)) * 100))
print('Testing accuracy: %.2f%%' % (accuracy_score(y_test, y_pred) * 100))
print('-'*80)
print('Confusion matrix:\n %s' % (confusion_matrix(y_test, y_pred)))
print('-'*80)
print('Classification report:\n %s' % (classification_report(y_test, y_pred)))
print('*'*100)
Divide both the original and PCA projected datasets into 80:20 ratio for train and test respectively. We will evaluate the model performance using both of the datasets to see the difference
# Divide the original dataset into train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
# Divide the projected dataset into train and test split
X_projected_train, X_projected_test, y_train, y_test = train_test_split(X_projected, y, test_size=0.2, random_state=1)
X_projected_train.shape, X_projected_test.shape, y_train.shape, y_test.shape
# Run Classification for Support Vector Classifier
models = [SVC(), GaussianNB(), RandomForestClassifier()]
_ = [run_classification(model, X_train, X_test, y_train, y_test) for model in models]
Observations:
# Run Classification for Support Vector Classifier
models = [SVC(), GaussianNB(), RandomForestClassifier()]
_ = [run_classification(model, X_projected_train, X_projected_test, y_train, y_test) for model in models]
Observations:
Let's perform the grid search using scikit-learn’s GridSearchCV which stands for grid search cross validation. By default, the GridSearchCV’s cross validation uses 3-fold KFold or StratifiedKFold depending on the situation.
# Run GridSearch to tune the hyper-parameter
st = time()
k_fold_cv = 10 # Stratified 10-fold cross validation
grid_params = [
{ 'C': [0.01, 0.05, 0.5, 1], 'kernel': ['linear', 'rbf'] },
{ 'C': [1, 10, 100, 1000], 'kernel': ['rbf'], 'gamma': [0.001, 0.0001] },
{ 'C': [1, 10, 100, 1000], 'kernel': ['poly'], 'degree': [2,3,4,5,6] }
]
grid = GridSearchCV(SVC(), param_grid=grid_params, cv=k_fold_cv, n_jobs=1, iid=False)
grid.fit(X_projected, y)
print('Best hyper parameter:', grid.best_params_)
print('Time taken %.2fs to tune the best hyper-parameter for Support Vector classifier' % (time()-st))
# Use the tuned estimator from GridSearch to run the classification
run_classification(grid.best_estimator_, X_projected_train, X_projected_test, y_train, y_test)
Multicolinearity and Curse of Dimensionality are the 2 major phenomenon which adversly impact any machine learning model. With higher degree of multicolinearity, model tend to leave behind the major information that is contained in the mathematical space of the input features. And with Curse of Dimensionality because of the feature space becoming increasingly sparse for an increasing number of dimensions of a fixed-size training dataset, model tend to overfit.
Principal Component Analyis helps adressing these problem and improves the model performance to a great extent.